Client Report - Late Flights & Missing Data (JSON)

Course DS 250

Author

Travis Wheeler

Show the code
import pandas as pd
import numpy as np
from lets_plot import *

LetsPlot.setup_html(isolated_frame=True)
Show the code
# Learn morea about Code Cells: https://quarto.org/docs/reference/cells/cells-jupyter.html

# Include and execute your code here
df = pd.read_json("https://github.com/byuidatascience/data4missing/raw/master/data-raw/flights_missing/flights_missing.json")

Elevator pitch

This project analyzed flight delay data to identify patterns in airport performance and delay causes. SFO had the highest overall delay rate, and the best months to fly with fewer delays were September through November. We also calculated a new metric for total weather-related delays to better understand their impact.

QUESTION|TASK 1

Fix all of the varied missing data types in the data to be consistent (all missing values should be displayed as “NaN”). In your report include one record example (one row) from your new data, in the raw JSON format. Your example should display the “NaN” for at least one missing value.__

We replaced missing values with consistent NaN values using df.replace(). Below is one example row showing the airport_name missing value represented as NaN.

Show the code
# Include and execute your code here
missing_values = ["", " ", "NA", "NULL", "null", ".", -999]
df.replace(missing_values, np.nan, inplace=True)

print(df.iloc[2])
airport_code                         IAD
airport_name                         NaN
month                            January
year                              2005.0
num_of_flights_total               12381
num_of_delays_carrier                414
num_of_delays_late_aircraft       1058.0
num_of_delays_nas                    895
num_of_delays_security                 4
num_of_delays_weather                 61
num_of_delays_total                 2430
minutes_delayed_carrier              NaN
minutes_delayed_late_aircraft      70919
minutes_delayed_nas              35660.0
minutes_delayed_security             208
minutes_delayed_weather             4497
minutes_delayed_total             134881
Name: 2, dtype: object

QUESTION|TASK 2

Which airport has the worst delays? Describe the metric you chose, and why you chose it to determine the “worst” airport. Your answer should include a summary table that lists (for each airport) the total number of flights, total number of delayed flights, proportion of delayed flights, and average delay time in hours.

ORD has the worst delays. I used delay_proportion and avg_delay_hours to compare airports. ORD had the highest avg_delay_hours (7115.67) and second highest delay_proportion (0.23).

Show the code
# Include and execute your code here
df_clean = df.dropna(subset=[
    "num_of_flights_total",
    "num_of_delays_total",
    "minutes_delayed_total"
])

# Convert delay columns to numeric just in case any were stored as strings
df_clean["num_of_flights_total"] = pd.to_numeric(df_clean["num_of_flights_total"])
df_clean["num_of_delays_total"] = pd.to_numeric(df_clean["num_of_delays_total"])
df_clean["minutes_delayed_total"] = pd.to_numeric(df_clean["minutes_delayed_total"])

# Group by airport and calculate summary metrics
summary = df_clean.groupby("airport_code").agg(
    total_flights=("num_of_flights_total", "sum"),
    total_delays=("num_of_delays_total", "sum"),
    avg_delay_minutes=("minutes_delayed_total", "mean")
).reset_index()

# Calculate proportions and convert minutes to hours
summary["delay_proportion"] = summary["total_delays"] / summary["total_flights"]
summary["avg_delay_hours"] = summary["avg_delay_minutes"] / 60

# Display summary table with selected columns
summary[["airport_code", "total_flights", "total_delays", "delay_proportion", "avg_delay_hours"]].round(2)
airport_code total_flights total_delays delay_proportion avg_delay_hours
0 ATL 4430047 902443 0.20 6816.15
1 DEN 2513974 468519 0.19 3178.46
2 IAD 851571 168467 0.20 1298.42
3 ORD 3597588 830825 0.23 7115.67
4 SAN 917862 175132 0.19 1044.98
5 SFO 1630945 425604 0.26 3352.33
6 SLC 1403384 205160 0.15 1278.20

QUESTION|TASK 3

What is the best month to fly if you want to avoid delays of any length? Describe the metric you chose and why you chose it to calculate your answer. Include one chart to help support your answer, with the x-axis ordered by month. (To answer this question, you will need to remove any rows that are missing the Month variable.)

I used delay_proportion to show that the best months to avoid delays are September, October, and November.

Show the code
# Include and execute your code here
# Filter out rows where 'month' is missing
df_month = df[df['month'].notna()]

# Group by month
monthly_stats = df_month.groupby('month').agg({
    'num_of_delays_total': 'sum',
    'num_of_flights_total': 'sum'
}).reset_index()

# Calculate delay proportion
monthly_stats['delay_proportion'] = monthly_stats['num_of_delays_total'] / monthly_stats['num_of_flights_total']

# Order months (not alphabetically)
month_order = [
    "January", "February", "March", "April", "May", "June",
    "July", "August", "September", "October", "November", "December"
]
monthly_stats = monthly_stats[monthly_stats['month'].isin(month_order)]
monthly_stats['month'] = pd.Categorical(monthly_stats['month'], categories=month_order, ordered=True)
monthly_stats = monthly_stats.sort_values('month')
monthly_stats['month'] = monthly_stats['month'].astype(str)  # convert to string to avoid casting issues

# Line chart
ggplot(monthly_stats, aes(x='month', y='delay_proportion')) + \
    geom_line(size=1.5) + \
    geom_point(size=3) + \
    ggtitle('Proportion of Flight Delays by Month') + \
    xlab('Month') + \
    ylab('Proportion of Delayed Flights') + \
    theme(axis_text_x=element_text(angle=45, hjust=1))

QUESTION|TASK 4

According to the BTS website, the “Weather” category only accounts for severe weather delays. Mild weather delays are not counted in the “Weather” category, but are actually included in both the “NAS” and “Late-Arriving Aircraft” categories. Your job is to create a new column that calculates the total number of flights delayed by weather (both severe and mild). You will need to replace all the missing values in the Late Aircraft variable with the mean. Show your work by printing the first 5 rows of data in a table. Use these three rules for your calculations:

a. 100% of delayed flights in the Weather category are due to weather  
a. 30% of all delayed flights in the Late-Arriving category are due to weather  
a. From April to August, 40% of delayed flights in the NAS category are due to weather. The rest of the months, the proportion rises to 65%    

January had the most delays due to weather. ORD having the highest total delays due to weather.

Show the code
# Include and execute your code here
late_aircraft_mean = df['num_of_delays_late_aircraft'].mean()
df['num_of_delays_late_aircraft'].fillna(late_aircraft_mean, inplace=True)
Show the code
# Include and execute your code here
# a
weather_severe = df['num_of_delays_weather']

# b
weather_late = df['num_of_delays_late_aircraft'] * 0.30

# c
weather_months = ['April', 'May', 'June', 'July', 'August']
weather_nas = np.where(
    df['month'].isin(weather_months),
    df['num_of_delays_nas'] * 0.40,
    df['num_of_delays_nas'] * 0.65
)
Show the code
# Include and execute your code here
df['weather_total'] = weather_severe + weather_late + weather_nas

# First five rows
df[['airport_code', 'month', 'num_of_delays_weather', 'num_of_delays_late_aircraft', 'num_of_delays_nas', 'weather_total']].head()
airport_code month num_of_delays_weather num_of_delays_late_aircraft num_of_delays_nas weather_total
0 ATL January 448 1109.104072 4598 3769.431222
1 DEN January 233 928.000000 935 1119.150000
2 IAD January 61 1058.000000 895 960.150000
3 ORD January 306 2255.000000 5415 4502.250000
4 SAN January 56 680.000000 638 674.700000

QUESTION|TASK 5

Using the new weather variable calculated above, create a barplot showing the proportion of all flights that are delayed by weather at each airport. Describe what you learn from this graph.

SFO had the highest proportion of weather-delayed flights, followed by ORD.

Show the code
# Include and execute your code here

# Calculate total weather delay proportion for each airport
df_weather = df.groupby('airport_code').agg({
    'weather_total': 'sum',
    'num_of_flights_total': 'sum'
}).reset_index()

df_weather['weather_delay_proportion'] = df_weather['weather_total'] / df_weather['num_of_flights_total']

# Create a barplot
ggplot(df_weather, aes(x='airport_code', y='weather_delay_proportion')) + \
    geom_bar(stat='identity') + \
    labs(
        title='Proportion of Flights Delayed by Weather per Airport',
        x='Airport',
        y='Weather Delay Proportion'
    )

STRETCH QUESTION|TASK 1

Which delay is the worst delay? Create a similar analysis as above for Weahter Delay with: Carrier Delay and Security Delay. Compare the proportion of delay for each of the three categories in a Chart and a Table. Describe your results.

type your results and analysis here

Show the code
# Include and execute your code here